Error Recovery in Critical Infrastructure Systems
نویسندگان
چکیده
Critical infrastructure applications provide services upon which society depends heavily; such applications require survivability in the face of faults that might cause a loss of service. These applications are themselves dependent on distributed information systems for all aspects of their operation and so survivability of the information systems is an important issue. Fault tolerance is a key mechanism by which survivability can be achieved in these information systems. Much of the literature on fault-tolerant distributed systems focuses on local error recovery by masking the effects of faults. We describe a direction for error recovery in the face of catastrophic faults where the effects of the faults cannot be masked using available resources. The goal is to provide continued service that is either an alternate or degraded service by reconfiguring the system rather than masking faults. We outline the requirements for a reconfigurable system architecture and present an error recovery system that enables systematic structuring of error recovery specifications and implementations.
منابع مشابه
Error Recovery in Critical Infrastructure Systems - Computer Security, Dependability and Assurance: From Needs to Solutions, 1998. Proceedings
Criticul infrastructure applications pmvide services upon which society depends heavily; such applications require survivabiliry in the face of faults that mighr cause a loss of service. These applications are lhemselves dependent on distributed information systems for all aspects of their operation ond so survivability of rhe information systems is an important issue. Fault tolerance is U key ...
متن کاملEfficient Infrastructure Restoration Strategies Using the Recovery Operator
Infrastructure systems are critical for society’s resilience, government operation, and overall defense. Thereby, it is imperative to develop informative and computationally efficient analysis methods for infrastructure systems, which reveal system vulnerabilities and recoverability. To capture practical constraints in systems analyses, various layers of complexity play a role, including limite...
متن کاملProposing an Efficient Software-Based Method for Enhancing the Reliability of Critical Application Robot
Robots play such remarkable roles in humans’ modern lives that performing many tasks without them isimpossible. Using robotic systems is gradually increasing the tasks allocated to them and they are becomingmore complex and critical. Software reliability is one of the most significant requirements of robots. Forenhancing reliability, systems should be inherently designed to be tolerable of soft...
متن کاملWatchdog Processor-Assisted Fast Recovery in Distributed Systems
A major concern in implementing a checkpoint-based recovery protocol for distributed systems is the performance degradation resulting from process roll-backs. In critical systems, it is highly desirable to contain the rollback distance as well as the number of processes involved in the rollback so that timely recovery is possible. One popular approach to accomplish such goals is to control the ...
متن کاملHierarchical Error Detection and Recovery in a Software Implemented Fault Tolerance (SIFT) Environment
A key issue in the design of reliable distributed systems is how to make the entities that provide the reliability properties of the system, themselves failure resilient. An application executing in such a system is dependent on these entities and hence, it is critical to protect not just the application, but also the components of the fault tolerance layer, through a variety of error detection...
متن کامل